Finding Frequent Substructures in Chemical Compounds

نویسندگان

  • Luc Dehaspe
  • Hannu Toivonen
  • Ross D. King
چکیده

The discovery of the relationships between chemical structure and biological function is central to biological science and medicine. In this paper we apply data mining to the problem of predicting chemical carcinogenicity. This toxicology application was launched at IJCAI’97 as a research challenge for artificial intelligence. Our approach to the problem is descriptive rather than based on classification; the goal being to find common substructures and properties in chemical compounds, and in this way to contribute to scientific insight. This approach contrasts with previous machine learning research on this problem, which has mainly concentrated on predicting the toxicity of unknown chemicals. Our contribution to the field of data mining is the ability to discover useful frequent patterns that are beyond the complexity of association rules or their known variants. This is vital to the problem, which requires the discovery of patterns that are out of the reach of simple transformations to frequent itemsets. We present a knowledge discovery method for structured data, where patterns reflect the one-tomany and many-to-many relationships of several tables. Background knowledge, represented in a uniform manner in some of the tables, has an essential role here, unlike in most data mining settings for the discovery of frequent patterns.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Warmr: a data mining tool for chemical data

Data mining techniques are becoming increasingly important in chemistry as databases become too large to examine manually. Data mining methods from the field of Inductive Logic Programming (ILP) have potential advantages for structural chemical data. In this paper we present Warmr, the first ILP data mining algorithm to be applied to chemoinformatic data. We illustrate the value of Warmr by app...

متن کامل

Improving Drug Discovery Process by Identifying Frequent Toxic Substructures in Chemical Compounds - A Graph Mining Approach

Discovery of drug using computer modeling is one of the major challenges in contemporary medicine. Developing new therapeutic drugs is an expensive and time consuming process. Toxicity, caused by substructures that are carcinogenic in nature, is one of the important aspects that need to be explored during drug discovery. The use of in silico methods at an early stage of drug discovery can great...

متن کامل

Mining Significant Chemical Substructures in 2D and 3D Spaces

We present scalable and efficient techniques to mine molecular repositories based on their representation in the 2D and 3D spaces. Both techniques show immense potential in molecular classification and activity prediction. Empirical evaluation has confirmed the potential of these techniques. Increased availability of large repositories of chemical compounds has created new challenges and opport...

متن کامل

Exploring the Limits of Graph Invariant- and Spectrum-Based Discrimination of (Sub)structures

The limits of a recently proposed computer method for finding all distinct substructures of a chemical structure are systematically explored within comprehensive graph samples which serve as supersets of the graphs corresponding to saturated hydrocarbons, both acyclic (up to n = 20) and (poly)cyclic (up to n = 10). Several pairs of smallest graphs and compounds are identified that cannot be dis...

متن کامل

Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees

Similarity search in databases of labeled graphs is a fundamental task in managing graph data such as XML, chemical compounds and social networks. Typically, a graph is decomposed to a set of substructures (e.g., paths, trees and subgraphs) and a similarity measure is defined via the number of common substructures. Using the representation, graphs can be stored in a document database by regardi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998